Skip to content

AWS EC2 Discovery: discover in all enabled regions#61337

Merged
marcoandredinis merged 7 commits intomasterfrom
marco/discover-allregions
Dec 2, 2025
Merged

AWS EC2 Discovery: discover in all enabled regions#61337
marcoandredinis merged 7 commits intomasterfrom
marco/discover-allregions

Conversation

@marcoandredinis
Copy link
Copy Markdown
Contributor

@marcoandredinis marcoandredinis commented Nov 13, 2025

AWS EC2 Server discovery requires a list of regions, where the service is going to look for EC2 instances.

This PR changes the matcher so that it accepts a wildcard value for the regions field.

This will call account.ListRegions AWS API, which returns all the regions enabled for the current account.

Docs will be updated afterwards.

Demo:

image
> tctl get discovery_config | yq .spec
aws:
  - install:
      enroll_mode: 1
      install_teleport: true
      join_method: iam
      join_token: teleport-ui-iam-1607154252
      script_name: default-installer
      sshd_config: /etc/ssh/sshd_config
    integration: teleportdev
    regions:
      - '*'
    ssm:
      document_name: AWS-RunShellScript
    tags:
      Discover: YesPlease
    types:
      - ec2
discovery_group: aws-prod

changelog: Added support for discovering EC2 instances in all regions, without enumerating them. Requires access to account.ListRegions in the IAM Role assumed by the Discovery Service.

Comment thread api/types/matchers_aws.go
Comment on lines -148 to +170
return trace.BadParameter("discovery service requires at least one region")
m.Regions = []string{Wildcard}
}

for _, region := range m.Regions {
if region == Wildcard {
if len(m.Regions) > 1 {
return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard)
}

// Only EC2 supports discovering regions.
// TODO(marco): add support for other resources types.
if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 {
return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " +
"enumerate all the regions or create a separate matcher for discovering EC2 resources")
}
break
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What implications does changing the CheckAndSetDefaults behavior have on backward compatibility?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change relaxes the validation: * is now a valid value for the regions array.

Upgrading
Not an issue, because everything that was valid before, is also valid now.

Downgrading
Setting a DiscoveryConfig with a * region, and then downgrading will emit a warning log:

WARN  skipping resource due to unmarshal error error:[

Mixed: Auth/Proxy vX, DiscoveryService vX-1 (X-1 denotes the previous version, not necessarily the previous major)
This will result in the same behavior as above (warn log message)

So, in both situations the DiscoveryConfig is ignored.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the discovery config is ignored nothing new will be discovered until the * is removed or the discovery service is upgraded correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
Upgrade, remove * and be explicit about which regions they want to discover.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also render the discovery service cache permanently unhealthy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to "downgrade" the resource from the Auth server for older clients. I'm not quite sure what we would substitute the wildcard with. I don't know that we want Auth trying to enumerate all regions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only we don't want to, Auth might not even have access to call the API.

We could set the region to aws-global sentinel value when the client is less than vXYZ. The cache would become healthy on the outdated Discovery Service client.
We would get the error down the line (eg, when trying to call ec:DescribeInstances)

So we would still get a failure, but just for that single DiscoveryConfig resource.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last option makes sense, you cannot use new features on older versions, but this should not result in the teleport process being completely broken (e.g. if I run both ssh and discovery service on the host, ssh should continue to work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last option makes sense

Are you referring to this option?
Detecting old clients and sending the aws-global region when the wildcard is set

I'm not only concerned about running two separate services, but having two discovery configs being consumed by a single Discovery Service.
If one of them has the regions: ["*"], both set of matchers will fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just implemented and tested that approach.
When clients are older than 18.4.2, it will replace the "*" with aws-global, which ensures the cache becomes healthy.
It does not mean it can discover the regions, but at least it's a soft error which does not impact the other DiscoveryConfigs being processed, neither other teleport services (eg, SSH Service).

Comment thread go.mod Outdated
Comment thread lib/srv/server/ec2_watcher.go Outdated
@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch from 1c46228 to 86f627d Compare November 14, 2025 15:10
@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch 3 times, most recently from 49e8d69 to 066cc98 Compare November 19, 2025 11:48
@marcoandredinis
Copy link
Copy Markdown
Contributor Author

@r0mant @zmb3 @hugoShaka @tigrato @strideynet @bernardjkim @rosstimothy Can you please take a look when you get a chance?

Comment thread api/types/matchers_aws.go Outdated
Comment on lines +148 to +153
return trace.BadParameter("discovery service requires at least one region")
m.Regions = []string{Wildcard}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still require a region to be specified instead of defaulting to wildcard?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From UX pov, I think the less options you need to set the better.

We are defaulting the tags matcher here, so that when nothing is set it will default to wildcard:wildcard.
I was following what we did for the tags, that's why I added it for the region.

I can remove the m.Regions = []string{Wildcard} when it's not set, and treat the empty regions list as fetch everything in the discovery service.

As an example, I think both configs should result in the same behavior:

aws:
  - integration: teleportdev
    tags:
      "*": "*"
    types:
      - ec2
    regions: ["*"]
aws:
  - integration: teleportdev
    types:
      - ec2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have performance/stability implications? How bad is it if you run the discovery service on many regions with thousands of resources?

Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it might be best to explicitly reject the config when no regions are specified. Consider an example from the teleport file config: we enable Auth and Proxy by default to make UX easy, but that in reality is a footgun and has caused customers to unknowingly have hundreds of Auth and Proxy services because they didn't explicitly disable them in their agent configs.

We could update the error message to hint that wildcards are an available option.

Copy link
Copy Markdown
Contributor Author

@marcoandredinis marcoandredinis Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is a fair comparison.
I see regions as a filter, so if I don't want to filter, I just expect the filter to not be there.

Another data point is that:

  • azure and gcp matchers support empty list of regions, and that translates to matching in all regions / locations
  • I see regions as a filter, same way we have for labels, which also translate empty map into "*":"*"

I'll still revert that check, and focus on the cache issue instead.

Comment thread api/types/matchers_aws.go
Comment on lines -148 to +170
return trace.BadParameter("discovery service requires at least one region")
m.Regions = []string{Wildcard}
}

for _, region := range m.Regions {
if region == Wildcard {
if len(m.Regions) > 1 {
return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard)
}

// Only EC2 supports discovering regions.
// TODO(marco): add support for other resources types.
if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 {
return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " +
"enumerate all the regions or create a separate matcher for discovering EC2 resources")
}
break
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also render the discovery service cache permanently unhealthy?

}

// GetInstances fetches all EC2 instances matching configured filters.
func (f *ec2InstanceFetcher) GetInstances(ctx context.Context, rotation bool) ([]Instances, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: Not something that needs to be addressed in this PR but something we should consider going forward to allow the discovery service to scale. This API could lead to OOMing the discovery service if a user has a massive amount of matching instances. It doesn't look like the consumer of these instances requires the entire set to perform its duties. If that is the case we could leverage iterators here to handle a page of instances at a time on demand.

Suggested change
func (f *ec2InstanceFetcher) GetInstances(ctx context.Context, rotation bool) ([]Instances, error) {
func (f *ec2InstanceFetcher) Instances(ctx context.Context, rotation bool) iter.Seq2[[]Instances, error]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we probably need to change this on all 3 implementations - aws, gcp and azure.

@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch from 066cc98 to a16dba6 Compare November 20, 2025 17:38
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once the downgrade scenario doesn't break the cache

Comment thread api/types/matchers_aws.go Outdated
Comment on lines +148 to +153
return trace.BadParameter("discovery service requires at least one region")
m.Regions = []string{Wildcard}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have performance/stability implications? How bad is it if you run the discovery service on many regions with thousands of resources?

Comment thread api/types/matchers_aws.go
Comment on lines -148 to +170
return trace.BadParameter("discovery service requires at least one region")
m.Regions = []string{Wildcard}
}

for _, region := range m.Regions {
if region == Wildcard {
if len(m.Regions) > 1 {
return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard)
}

// Only EC2 supports discovering regions.
// TODO(marco): add support for other resources types.
if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 {
return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " +
"enumerate all the regions or create a separate matcher for discovering EC2 resources")
}
break
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last option makes sense, you cannot use new features on older versions, but this should not result in the teleport process being completely broken (e.g. if I run both ssh and discovery service on the host, ssh should continue to work

Comment thread api/types/matchers_aws.go Outdated
Comment on lines +164 to +168
if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 {
return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " +
"enumerate all the regions or create a separate matcher for discovering EC2 resources")
}
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this out of the CheckAndSetDefaults and validate this property when creating/updating the resource. This might save us a big headache when we'll need to add region discovery to other things later and older versions will get angry.

(this might not be a big issue with the cache changes anymore)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will log the failure in DiscoveryService instead, during processing.

This whole validation is very problematic 😢
I intend to change validations of other fields, so I'll get the same cache issue again.

Can I just remove the CheckAndSetDefaults when doing Un/MarshalDiscoveryConfig calls?

if err := discoveryConfig.CheckAndSetDefaults(); err != nil {

Comment thread lib/srv/server/ec2_watcher.go
@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch 2 times, most recently from ec9b86b to 288a3a2 Compare November 26, 2025 16:24
@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch from 288a3a2 to 7703df2 Compare November 28, 2025 18:03
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from bernardjkim December 1, 2025 09:58
@marcoandredinis marcoandredinis force-pushed the marco/discover-allregions branch from 7703df2 to 1e02293 Compare December 2, 2025 12:16
@marcoandredinis marcoandredinis added this pull request to the merge queue Dec 2, 2025
Merged via the queue into master with commit 14706fe Dec 2, 2025
46 checks passed
@marcoandredinis marcoandredinis deleted the marco/discover-allregions branch December 2, 2025 13:33
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@marcoandredinis See the table below for backport results.

Branch Result
branch/v18 Failed

marcoandredinis added a commit that referenced this pull request Dec 2, 2025
* AWS EC2 Discovery: discover in all enabled regions

* use wildcard

* require at least one region

* downgrade discovery service for old clients

* move region x type check to runtime

* add test back

* fix expected version
github-merge-queue bot pushed a commit that referenced this pull request Dec 2, 2025
* AWS EC2 Discovery: discover in all enabled regions

* use wildcard

* require at least one region

* downgrade discovery service for old clients

* move region x type check to runtime

* add test back

* fix expected version
github-merge-queue bot pushed a commit that referenced this pull request Dec 3, 2025
* AWS EC2 Discovery: discover in all enabled regions

* use wildcard

* require at least one region

* downgrade discovery service for old clients

* move region x type check to runtime

* add test back

* fix expected version
21KennethTran pushed a commit that referenced this pull request Jan 6, 2026
* AWS EC2 Discovery: discover in all enabled regions

* use wildcard

* require at least one region

* downgrade discovery service for old clients

* move region x type check to runtime

* add test back

* fix expected version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants